Preparation

SciKit

We are using the brand new 0.16.0. Upgrade using:

conda install scikit-learn

Data Preparation

Make sure trainLabels.csv and _mixlbp.csv are both present in the same directory as the .ipynb (this notebook)

Utilities

Developed to faciliate this:

tr_utils.py - often used functions
train_files.py - aids in file manipulation (loading features)
SupervisedLearning.py - thin wrapper arround scikit supervised learning algorithms
train_nn.py - neural net using PyBrain

Misc

If <tab> completion is not working, install pyreadline:

easy_install pyreadline

Methodology

We are running three classifiers and blending their results:

SVM
Neural net
Calibrated random forest

_mixlbp.csv contains all the features (below) used for the experiments (as well as for final competition participation). It contains all the samples of the training set. For this demo, we split this set 9:1, where 90% is used for training and 10% for validation.

We then run three different classifiers on these sets (training on the larger one and validating on the smaller one), and try to blend the results by simple voting in order to decrease the resulting log loss (computed on the validation dataset).

Feature Selection

We are training with two types of features:

Features selected from the binary files as described in the 1dlbp article. These produce a histogram of 256 bins for each file
Features selected from .asm files are binary features, indicating whehter a given file contains a certain Windows API. 141 top APIs are picked for this purpose
Number of subroutines in each file (one additional feature). Giving us a 398-dimensional vector



In [1]:

    
from SupervisedLearning import SKSupervisedLearning
from train_files import TrainFiles
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, confusion_matrix
from sklearn.calibration import CalibratedClassifierCV
from tr_utils import vote
import matplotlib.pylab as plt
import numpy as np
from train_nn import createDataSets, train

An auxilary function

Let's define a function that plots the confusion matrix to see how accurate our predictions really are



In [2]:

    
def plot_confusion(sl):
    conf_mat = confusion_matrix(sl.Y_test, sl.clf.predict(sl.X_test_scaled)).astype(dtype='float')
    norm_conf_mat = conf_mat / conf_mat.sum(axis = 1)[:, None]

    fig = plt.figure()
    plt.clf()
    ax = fig.add_subplot(111)
    ax.set_aspect(1)
    res = ax.imshow(norm_conf_mat, cmap=plt.cm.jet, 
                    interpolation='nearest')
    cb = fig.colorbar(res)
    labs = np.unique(Y_test)
    x = labs - 1

    plt.xticks(x, labs)
    plt.yticks(x, labs)

    for i in x:
        for j in x:
            ax.text(i - 0.2, j + 0.2, "{:3.0f}".format(norm_conf_mat[j, i] * 100.))
    return conf_mat

Load data from the text file

Loaded data contains all of the training examples.

NOTE: Actually almost all. 8 are missing, because binary features could not be extracted from them.



In [3]:

    
train_path_mix = "./mix_lbp.csv"
labels_file = "./trainLabels.csv"
X, Y_train, Xt, Y_test = TrainFiles.from_csv(train_path_mix, test_size = 0.1)

The last line above does the following:

Loads the examples from the csv files, assuming the labels are in the last column
Splits the results in a training and a validation dataset using sklearn magic, based on the test_size parameter (defaults to 0.1). $test\_size \in [0, 1]$.
Returns two tuples: (training, training_labels), (testing, testing_labels)

Training

Training consists of training three models

SVM with an RBF kernel
Random foreset with the calibration classifier installed in 0.16.0 of scikit
Neural net

Train SVM

We neatly wrap this into our $\color{green}{SKSupervisedLearning}$ class The procedure is simple:

Instantiate the class with the tuple returned in by the TrainFiles instance or method above and the desired classifier
Apply standard scaling (in scikit this is the Z-score scaling which centers the samples and reduces std to 1). NOTE: This is what the SVM classifier expects
Set training parameters
Call $\color{green}{fit\_and\_validate()}$ to retrieve the $\color{green}{log\_loss}$. This function will compute the log loss on the validation dataset. It will also return the training log loss, which may be interesting but is not out goal. Either way, it is going to be spectacularly small. :)



In [4]:

    
sl = SKSupervisedLearning(SVC, X, Y_train, Xt, Y_test)
sl.fit_standard_scaler()
sl.train_params = {'C': 100, 'gamma': 0.01, 'probability' : True}
ll_trn, ll_tst = sl.fit_and_validate()

print "SVC log loss: ", ll_tst









    



SVC log loss:  0.0505528463283

You can play with the parameters here to see how log loss changes. SKSupervisedLearning wraps the sklearn grid search technique for searching for optimal parameters in one call. You can take a look at the implementation details.

Let's plot the confusion matrix to see how well we are doing (values inside squares are %s): (change magic below to %matplotlib qt to get the out-of-browser graph)



In [5]:

    
%matplotlib inline
conf_svm = plot_confusion(sl)

As expected, we are not doing so well in class 5 where there are very few samples.

Train Neural Net`

This is a fun one, I promise. :)

The neural net is built by PyBrain, has just one hidden layer which is equal to $\frac{1}{4}$th of the input layer. The hidden layer activation is sigmoid, the output - softmax (since this is a multi-class neural net), and has bias units for the hidden and the output layers. We use the PyBrain $\color{green}{buildNetwork()}$ function that builds the network in one call.

NOTE: We are still using all the scaled features to train the neural net

I am setting %matplotlib to qt so training can be watched in real time You will see each training epoch charted. The graph on the left shows % error, the one on the right - log loss.

You can play with the "test error" or "epochs" parameter to control how long it runs.Limiting it to just 10 epochs for this experiment



In [6]:

    
%matplotlib qt
trndata, tstdata = createDataSets(sl.X_train_scaled, Y_train, sl.X_test_scaled, Y_test)
fnn = train(trndata, tstdata, epochs = 10, test_error = 0.07, momentum = 0.15, weight_decay = 0.0001)









    



epoch:    1   train error:  6.28%   test error:  8.29%  test logloss: 0.3495  train logloss: 0.2659
epoch:    2   train error:  3.98%   test error:  5.52%  test logloss: 0.2491  train logloss: 0.1615
epoch:    3   train error:  2.89%   test error:  5.16%  test logloss: 0.2025  train logloss: 0.1166
epoch:    4   train error:  2.55%   test error:  4.51%  test logloss: 0.1740  train logloss: 0.0970
epoch:    5   train error:  2.02%   test error:  4.60%  test logloss: 0.1768  train logloss: 0.0806
epoch:    6   train error:  1.71%   test error:  4.05%  test logloss: 0.1618  train logloss: 0.0735
epoch:    7   train error:  1.64%   test error:  3.87%  test logloss: 0.1584  train logloss: 0.0656
epoch:    8   train error:  1.54%   test error:  3.41%  test logloss: 0.1436  train logloss: 0.0600
epoch:    9   train error:  1.23%   test error:  3.13%  test logloss: 0.1382  train logloss: 0.0529
epoch:   10   train error:  1.20%   test error:  3.41%  test logloss: 0.1400  train logloss: 0.0536






    



C:\Anaconda\lib\site-packages\matplotlib\backend_bases.py:2380: MatplotlibDeprecationWarning: Using default event loop until function specific to this GUI is implemented
  warnings.warn(str, mplDeprecation)

Train Random Forest with Calibration

Finally, we train the random forest (which happens to train in seconds) with the calibration classifier (which takes 2 hours or so)

Random forests are very accurate, the problem is that they maker over-confident predictions (or at least that is what the predict_proba function, which is supposed to return probabilities of each class gives us). So, god forbid we are ever wrong! Since it predicts the probability of 0 on the correct class, log loss goes to infinity. Calibration classifier makes predict_proba return something sane.



In [8]:

    
sl_ccrf = SKSupervisedLearning(CalibratedClassifierCV, X, Y_train, Xt, Y_test)
sl_ccrf.train_params = \
    {'base_estimator': RandomForestClassifier(**{'n_estimators' : 7500, 'max_depth' : 200}), 'cv': 10}
sl_ccrf.fit_standard_scaler()
ll_ccrf_trn, ll_ccrf_tst = sl_ccrf.fit_and_validate()

print "Calibrated log loss: ", ll_ccrf_tst









    



Calibrated log loss:  0.0613262070702

As you can see, we are simply wrapping the $\color{green}{RandomForestClassifier}$ in the $\color{green}{CalibratedClassifier}$. Plot the matrix (after a couple of hours):



In [10]:

    
%matplotlib inline
conf_ccrf = plot_confusion(sl_ccrf)

Voting

Now we can gather the results of our experiments and blend them. We use a simple weighted voting scheme for that.

$\color{green}{vote}$ function is implemented in _trutils.py.

Here we are trying to balance the weights of SVM and calibrated RF.



In [11]:

    
%matplotlib inline
x = 1. / np.arange(1., 6)
y = 1 - x

xx, yy = np.meshgrid(x, y)
lls1 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])
lls2 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])

for i, x_ in enumerate(x):
    for j, y_ in enumerate(y):
        proba = vote([sl.proba_test, sl_ccrf.proba_test], [x_, y_])
        lls1[i, j] = log_loss(Y_test, proba)

        proba = vote([sl.proba_test, sl_ccrf.proba_test], [y_, x_])
        lls2[i, j] = log_loss(Y_test, proba)

fig = plt.figure()
plt.clf()
ax = fig.add_subplot(121)
ax1 = fig.add_subplot(122)

ax.set_aspect(1)
ax1.set_aspect(1)

res = ax.imshow(lls1, cmap=plt.cm.jet, 
                interpolation='nearest')
res = ax1.imshow(lls2, cmap=plt.cm.jet, 
                interpolation='nearest')

cb = fig.colorbar(res)

The graphs show a "blended" log loss. A matrix on the left blends SVM and RF log losses with weights "favoring" SVM, and the one on the right "favors" RF.